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Article history: In order to analyze the behaviors of human, significant extent of work has 
been carried out in the video surveillance applications. While considering the 
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costs. In this paper, we are proposing multi-observational detection and 
tracking approach (MoDTA) that is based on _ observational filter. 
Keywords: The MoDTA initially acquires people location in an image, so that it can 
detect conviction value at pointed locations which generally increases with 
respect to people density. In the phase of tracking, MoDTA computes the 
multiple observed weight values and individual features, also advection 
particle is used at motion model in order to facilitate the dense scenario 
Tracking tracking. Coefficient of correlation is used as template detector and the 
function of template detector is to estimate the upcoming object. 
Our proposed MoDTA is compared with other existing detection and 
tracking methods in order to evaluate the system performance. 
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1. INTRODUCTION 

It 1s identified that the abnormal behavior recognition and group behavior analysis are the major 
difficulties in the video surveillance system, and to address this problem several researchers have worked on 
it. Lots of efforts have been put in for anomaly detection [1] and it shows the importance of it. Therefore, 
subtopic like feature representation is considered to be highly important for detailed description. Feature 
representation forms an indispensable basis which is extremely correlated with detection approach. Since the 
analysis of crowd scene requires fundamental models of crowd, initially the crowd knowledge acquisition 
from crowd dynamics should be summarized before the crowd detection model. Although a variety of 
representation models and approaches have been proposed, yet there is no any accepted general solution for 
analysis of crowded scene. Here we consider to detect the anomalous behavior in videos. 

The explicit event based approach uses supervised model [2] where the abnormality of behavior can 
be learnt by using training set. But the problem with this type of approach 1s that the detection of abnormality 
generally depends upon training dataset (1.e., previously collected dataset). An approach based on detecting a 
specific abnormality manually is developed for particular applications such as detecting threats for the 
‘cargo’ video observation system [3]. Unsupervised method (i.e. not dependent on any prior knowledge and 
training) can identify abnormal behaviors. However this category generally detects only the simple abnormal 
scenarios such as bicycles and cars among pedestrians. In this approach they generally analyze the 
acceleration and optical flow that is very different among these objects [4]. 
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A generalized trajectory based approach first sets the scene of interest for crowd and then splits it 
into different objects. Afterwards the objects are monitored through their behavior in video sequences which 
forms the extracted feature trajectories [5]. Zone based approach, deep learning based model [6], temporal 
spatial path search [7], string kernels clustering [8], single class-SVM (Support Vector Machine) [9] are 
applied to compute anomalies in trajectories. The type of methodology depends on the nature of object and 
the kind of people who are tracked. The performance of tracking is affected by using rapid motion and low 
video resolution. The aim of using global pattern based approach is not only to track and detect individually 
in a scene, but try the methodology to get low/medium features for video scene and to analyze the feature for 
whole entity [10]. In general, these features can be used for optical flow and spatial temporal gradients, 
moreover there are some approaches which are quite effective while dealing with group activity such as 
motion influence map [11], stationary map [12], Gaussian regression [13], global motion map [14], salient 
motion map [15], energy model [16], PCA-model [17], Gaussian mixture model [18], Social force 
model [19] etc. 

Tracking and detection of crowd abnormality is a very challenging activity due to the randomly 
changing crowd dynamics. But prediction trackers are not modelled to handle differentiating individual 
scenarios. In this paper, we are proposing multi-observational detection and tracking approach (MoDTA) that 
is based on observational filter. The MoDTA initially acquires the people location in an image and detects 
conviction value at pointed locations, which generally increases with respect to people density. In the phase 
of tracking, MoDTA computes the multiple observed weight values and individual features are added with 
scenarios. Also the advection particle 1s used as motion model in order to facilitate the dense scenario 
tracking. The function of template detector is to compute the estimation of the upcoming object. Coefficient 
of correlation is used as a template detector. Here we use UMN dataset for detecting the abnormality in a 
given scene, moreover our proposed MoDTA is compared with other existing detection and 
tracking methods. 


2. LITERATURE SURVEY 

In some recent years the auto scene analysis and understanding has attracted lots of research 
attention in the community of computer vision [20] and the major application can be seen in intelligent 
surveillance which has replaced the traditional way of video surveillance. Though several approaches have 
been developed for recognizing, tracking and understanding of various objects behavior in video scene, 
they are mainly modelled for commonly used scenes at low population density. But whenever crowded 
scenes are analyzed the difficulty arises due to the large number of objects involvement which not only 
causes the failure in detection but also in tracking. Thereby increases the computational complexity. 
The group scenarios [21] are major entities which comes from the crowd. Therefore understanding the 
properties of group level is very important. 

In paper [22], they aim at identifying the modelling problem to differentiate the captured video of 
the surveillance application into normal and abnormal. Moreover, a new framework is generated in order 
provide an anomaly detection and automatic behavior profiling, which depends on group clustering. 
The behavior analysis can be efficiently applied at crowded scenes with random distribution and at crowd 
density which is potentially necessary in several applications such as abnormal event detection, crowd video 
classification and crowd dynamic monitoring in the security surveillance. The event based abnormality 
detection is the procedure to obtain the abnormal scenario which is compared with bulk of usual events and 
the major challenge present in this is the dynamic variations in scenes and large structured redundancy in 
videos surveillance. In order to overcome these problems in [23], they have proposed a framework for 
detection of abnormality and localization in video scenes that depend on the constrained locality-constrained 
affine subspace coding (LASC) and updating approach. In this study, they have used LASC in order to 
reconstruct the sample model through its nearest top-k subspaces which 1s originated through segmenting the 
usual sample space by clustering approach. A sample of large reconstruction cost is detected by setting up a 
threshold value. In this paper [24], they have introduced a model and an algorithm with a case study to 
validate them and to optimize the pattern recognition, event detection and abnormality identification under a 
real-life video surveillance. 

The work includes observing the human nature patterns in an over-all continuously changing nature 
and adapt with the time, rather being static. In existing works there are some limitations which are identified 
and accordingly the dynamic clustering algorithm is used in [25] to overcome the drawbacks. Congruently, 
they have proposed maintaining the concept of two different data-sets in parallel such as abnormal plane and 
normal plane in order to acquire successfully the task of learning. Moreover, the practicable scenario of the 
model has been demonstrated by considering real-life cases. Computing the abnormality for crowded scenes 
is becoming difficult and critical in internet services and cloud environment due to end-to-end user 
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experience and quality of service. However, the vast changes in behavior of metric streams has increased the 
challenges, which may diagnose the frameworks based on threshold with normal or stationary assumption. 
Generally complex models demand extensive offline training. These approaches are inclined to unauthentic 
false-alarms in the online settings, therefore the metric streams experience quick contextual changes via 
known reference point. Initially, they predict a property of underlying temporal stream through adaptive 
learning and afterwards they apply robust statistical control charts in order to identify deviations [26]. 

To handle the obstruction and complex scenes several approaches uses extracted features from 
motion cue and low-level appearance. Optical flow and texture flow are used for object tracking. Several 
frequently used features include histogram oriented gradients, histogram based optical flow and 3D ‘spatial- 
temporal-features’. Spatial-temporal-features are used to provide Markov field model. Normal behavior 
statistics detects the crowd abnormal behavior. In paper [27] they have employed multi-scale feature of 
histogram based optical flow that incorporates sparse representation via detection through 
reconstruction cost. 

In paper [28], they have proposed a novel approach that uses Markov random field and local-optical- 
flow to detect the unusual events. In paper [29] they have developed a methodology to determine the unusual 
behavior by using statistical aggregates. An enhanced model for determining the unusual behavior through 
texture, size and foreground is proposed in [30]. To describe the size of spatialtemporal the sparse behavior 
is represented to identify abnormal activity. A method is developed for unusual behavior detection that is 
based on Markov random field (MRF) and they have employed optical flow acceleration and histogram 
optical flow as the behavioral feature [31]. In paper [32] they have used integrated model from motion cues 
and appearance. In this paper [33] a novel descriptor has been proposed for detection of abnormality through 
enhancing the acceleration concept by using hybrid optical flow. However, these adopted features are crafted 
manually that have a great side to detect anomaly. Though it requires prior information, it is hard to extract 
from complex video scenes and also involves huge computational costs. 


3. MULTI-OBSERVATIONAL DETECTION AND TRACKING APPROACH (MODTA) 
3.1. Density Informed Energy Formulation 

Here, we assume the conviction value c(b) to detect a person B at location B;. In an image, location 
number is given ast = 1...,A. In order to compute the person density, initially the people number at per 
pixel is estimated. The estimated E(B, ) is in a window having size d at the location B;. The major aim is to 
get the people location in an image so that it detects conviction value at pointed locations. Convection value 
given by FE increases with respect to people density, and also it prevents from appropriate overlapping 
detections. The detection process is encoded in the image by using a A-vector which is given as x € {0,1}4. 
When x; is considered to be one then B; detection 1s switched on otherwise it 1s zero and the difficulty in 
detection can be formulated to minimalize. Decreasing density error (Fep) implies minimalizing the 
differences in obtained density estimations using FE (b) estimator. The minimization of cost function can be 
written as; 


min Ge’ a xAKx = <=) (1) 
[x€{0,1}4] FEp Fp Fs 


From (1), Fs shows the high certitude values at locations to detect people from person detector and 
is indicated via x; = 1. In (1), Fp shows the valid configuration for selecting the detection of non-overlapping 
and this can be obtained via setting K;; = 00, which happens when the location detected at b; and b; have 
optimized area overlap ratio, otherwise it is considered to be zero. 

Here, the Fp term used in the (1) 1s to model the constraints of crowd density by correcting the gap 
between the values of density, first one computed with density based regression estimator EF and second one 
computed through detection of switch (1.e., on or off) in x. In order to evaluate the active detection density 
in x, the matrix multiplication is performed, where G represents the A X A matrix with G; rows; 





G. (1) =— bi-1) I" 5 
i (lj) = aeexp 22 ©) 


Where, the size of Gaussian window (d) is centered at b; position. Optimizing the cost value Fgp in (1) 
enables to improve the person detection by correcting certitude value in lower density of 
image regions. 
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In a simpler way we can say that d is same through whole image width, the term Fzp can decrease 
up-to a multiplicative constant in order to correct gaps between the switches at detection in x, and the 
D density estimator is used to get entire number of people from considered image. 

Now we propose a multiscale detector for individual person detection, which includes the score map 
of dense detection c(b) which is used in cost equation. The function c(b) is integrated with the density 
computation E'(b). Also it requires geometrics computation for crowd scenes footage that generally involves 
several moving people on a filmed plane and the plausible positions of peoples. 


3.2. Detecting and Tracking Approach 

In this section, we are providing detection and tracking model, where the multi-observational 
process is performed and consists of in-scenario (z$~""), out-scenario (z$~°“*) observations with a similarity 
measurement (Z;), which helps to compute the person activity in split and merge case. The major aim is to 
consider the multi-observational process that can decompose the typical observational model of filter in-to 
split observation. Therefore the interaction and similarities with individual persons are observed and can 
be written as; 


B(Z_|x-) = b(zi, 2? |x-) (3) 
B(z,|%,) = b(zi|2?, x, )b(zP |x.) (4) 


Where, the observation of people interaction with different people is zi and the individual similarity 
observation is given by Ze . The independency present at the interaction and similarity, can be written as; 


B(Z,|x;) = b(zi|x,)b(zP |x) (5) 
Therefore, the multi-observations can split the observed interactions into in-scenario (z~") and 


out-scenario (z§~°“") observations, thus the interaction evaluation is performed as in and out 
sets of scenarios; 


b(zilx,) = b(ze-™, ze |x) (6) 
b(zilx_) = (ze [zg x, )B(zE-™ Xe) (7) 
b(zilxe) = b(ze-™ [xp ) BE, x) (8) 


Hence the observed model can be written as; 
B(Z;|x,) = Digs "eb ae ae bz? x2) (9) 


In-scenario observation is denoted by b(z; eH xz) and it measures the belonging degree associated 
to that scenario, also the out-scenario observation is denoted by b(z;~°“"|x,) and it measures the not 
belonging degree associated to that scenario. Here, Me: denotes the directional similarity and m$%, denotes 
normalize spatial distance. 


mo, = (1 + cos ae (10) 
mo = a mingep(%40)) (11) 


Where, (sd) denotes the distance value between x and z, @ shows the angle difference between motion 
vectors. The set of boxes of individual detection results in first frame denoted by s and, the height of 
bounding box is q;, and width is q,, in p set that is used in normalization purpose. Therefore, the interaction 
weight can be calculated as; 


Mz = My 2M (12) 
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Here, BS~°“" and BS~'" describe the number of persons in the out-scenario and in-scenario 
respectively, out-scenario measures m}z;°“" and in-scenario computes m};'" in between the individual 


value of x and z. It can be shown as; 


i= Opn in 
Or mes mn (13) 


Otherwise it can be given as; 


mi, = me, ie 


Where, m? and m*“ shows the threshold standards of dictionary similarities and spatial closeness 
and hence r € {s — in, s — out}. The value of m$;'" for an individual can be computed with respect to the 
persons of BS” and m§>°“ for an individual can be computed with respect to the persons of BS~™. 

Several individuals link with others indirectly for in and out scenarios. For that weights are needed 
to be filtered in order to provide indirect connection information. That is why we update weights 
if my, becomes zero, but considering individual s it can be written as; 


Tr 6 4,sa r 6 4,sa 
M5 > m?’m* andms, > m’m (15) 


Considering this case, we can set the value asm;, = mj,. Because x is linked toZ vias, after 
performing refinement operation of individuals-weights the possible event wu 1s determined by; 


P,,msz" = 0 


16 
Vw >0 (16) 


U(x;Z;5) -| 


If the above condition is not matched then U(x;z;s) value is none, wherez€ BS" ands 
becomes s € BS" EF is a function that detects the split activity between x and z is P, and U is function 
that detect the merge activity between x and z is denoted by V,,, moreover the Euclidian distance is used to 
compute the individual connection b(z? |x; ). 

Moreover, the motion approach is formed by integrating with state object that is composed of 
position (i.e., b, and b,) and velocity (such as w,, and w,); 


xt = [b,. b, Wy W, | (17) 

T(Xe411Xo.0) = 11 — BI X Wea Xe41|Xo:t- Vote.) + BX Woe Xe41 1X) (18) 
Where, f determines the relative components weight and it is selected inversely proportional with respect to 
group density in order to provide higher weighted particle in denser region. Hypothesis 
part por (Xe44|X_) and mp, (X;4,|X_) can produce motion vectors by template detector and particle 


advection, also the higher weight is used at detector template in the sparse regions, so the updated group 
density is given by; 


Psa ="? Ig, (19) 


Where, 1, denotes the number of people present in group scenario, U,, shows for outsider person who not 
intersecting with group scenario and then the 6 can be written as; 


B=°/o.4 (20) 


The 6 denote for the obtained similarity measurement in between the detected template and objects. 
So the model of linear motion dynamics with the Gaussian noise 1s given as; 


hea = ORE + by ae 
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Where, ~ denotes the matrix of particle dynamics, is Gaussian noise and a is for group state att time. 
Through using velocity vector of every tracked object is engaged for several frames, along with this 
similarity of existing velocity vector is computed that is based upon the earlier motion of traced object, 
here we select the object with the vector of consistence velocity, then mp,(X;4,|X;) is used to compute 
velocity vector. The function of template detector 7p.;(X741|X;) is used to compute the [ w,. w, | estimation 
for the upcoming object by detection, a coefficient of correlation is used as to be template detector. 


4. RESULT ANALYSIS 

In this section, we are doing result simulation using Matlab-2016b and system configuration Intel 15 
Processor, 8GB RAM, 2GB NVIDIA Graphics with windows 10 operating system. Here, we considered very 
popular data-set for the video surveillance research purpose such as UMN dataset [34], which originally is in 
avi format. Additionally, it consists of four different data-sets such as courtyard, crowd, corridor and hit-run 
in the form of video. Each dataset having thirty frames per second. In that we are taking only 2 frames per 
second, which increases the chances for usability in real-time scenario. In each dataset, there are normal 
frames and abnormal frames, where normal frames shows the normal activity of crowd and abnormal frames 
shows the abnormal activity of crowds. To compute the effectiveness of our proposed model here we 
consider Ground Truth as to be reference. 

Figure | shows the normal crowd activity at dataset 1 in a courtyard scenario and Figure 2 shows the 
abnormal crowd activity at dataset | in a courtyard scenario. Figure 3 shows the comparison with respect to 
Ground Truth (GT) at courtyard dataset, where the GT shows the abnormal frame (AF) is started from 37th 
frame and our MoDTA model detected at 38th frame and so on. 





Figure 1. Normal crowd activity at dataset | Figure 2. Abnormal crowd activity at dataset 1 
(Courtyard) (Courtyard) 


MoDTA 





Figure 3. Comparison w.r.t ground truth at courtyard dataset 


Figure 4 shows the normal crowd activity at dataset 2 and Figure 5 shows the abnormal crowd 
activity at dataset 2. Figure 6 shows the comparison with respect to Ground Truth (GT) at crowd dataset, 
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where the GT shows the abnormal frame (AF) is started from 32th frame and our MoDTA model detects at 
33th frame. The normal frame (NF) is detected in GT from 1 to 31 and abnormal frames from 32 to 40. 
MoDTA detects 1 to 32 as the normal frame and rest of them ie. 33 to 40 as abnormal frame. Figure 7 shows 
the normal crowd activity at dataset 3 in corridor scenario and Figure 8 shows the abnormal crowd activity at 
dataset 3 in corridor scenario. 


Psbnormat Crowd Activity 





Y 






h 


Figure 4. Normal crowd activity at dataset 2 (Crowd) Figure 5. Abnormal crowd activity at dataset 2 
(Crowd) 





Figure 7. Normal crowd activity at dataset 3 Figure 8. Abnormal crowd activity at dataset 3 
(Corridor) (Corridor) 
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Figure 9 shows the comparison with respect to GT at corridor dataset. Here the GT shows the 
abnormal frame (AF) is started from 21th frame and our MoDTA model detectes at 23th frame. The normal 
frame (NF) is detected in GT from | to 20 and abnormal frames from 21 to 34. MoDTA detected 1 to 22 as 
the normal frame and rest of them 23 to 34 as abnormal frame. 

Figure 10 shows the normal crowd activity at dataset 4 in hit-run scenario and Figure 11 shows the 
abnormal crowd activity at dataset 4 in hit-run scenario. Figure 12 shows the comparison with respect to GT 
at hit-run dataset, where the GT shows the AF is started from 13th frame and our MoDTA model detects at 
15th frame. The normal frame (NF) is detected in GT from 1 to 12 and abnormal frames from 13 to 30. 
MoDTA detects 1 to 14 as the normal frame and rest of them 15 to 30 as abnormal frame. 

ROC is a most common technique to visualize the performance of a classifier. Here, we have used 
BoF classifier which is a binary classifier because we have only two classes such as normal and abnormal. 
According to our paper, a crowd scenario can be classified into two categories first one is normal activity and 
other is abnormal activity. ROC curve is shown in Figures 13, 14, 15 and 16 for different UMN datasets. 
Figure 17 shows the comparison with respect to different existing techniques for Area Under Curve 
(AUC %). Our model got 99.62% AUC which compared with existing SFM [35] and we got 4.7% more 
AUC. Moreover, the proposed model is compared with Chaotic invariants [36], Sparse reconstruction [37], 
Local statistics [38], MDT [39], and we got 0.22%, 0.02%, 0.12%, and 0.12% more AUC. 





Figure 9. Comparison w.r.t ground truth at corridor dataset 





Figure 10. Normal crowd activity at dataset 4 Figure 11. Abnormal crowd activity at dataset 4 
(Hit-run) (Hit-run) 
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Figure 12. Comparison w.r.t ground truth at hit-run dataset 
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Figure 15. ROC plot for Dataset-3 Figure 16. ROC plot for Dataset-4 
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Figure 17. Comparison w.r.t different existing technique 


5. CONCLUSION 

The behavior of crowded scenes is difficult to understand with respect to machine point of view, 
also the presence of huge diversity and integral complexity in frame makes it very problematic. In order to 
understand the different behavior of crowd when the crowd-dynamics and crowd-context are changing over a 
time, it 1s very much difficult to understand. In this paper, we have proposed the MoDTA that is based on 
observational filter. The MoDTA initially acquires the people location in an image so that it detects 
conviction value at pointed locations. MoDTA computes the multiple observed weights values and individual 
features, so that they can predict the abnormal and normal behavior in a given frame. The proposed approach 
is applied on the different sets of UMN dataset, afterwards in result analysis we have used BoF classification 
model to get the AUC with respect to GT, our model got 99.62% AUC which 1s 1.01% more compared to 
LMVD [40] that defines the significance of model performance. 
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